First, I will analyze the dataset by analyzing each variable by visualization. We have 870 observations in total (employees).
I also convert the other variables to factor. I will work on the dataset data1 (<-data0).
data1 <- data0
factorcolumns1 <- c("Education", "EnvironmentSatisfaction", "JobLevel", "NumCompaniesWorked", "PercentSalaryHike", "StockOptionLevel", "TotalWorkingYears", "TrainingTimesLastYear", "YearsAtCompany", "YearsInCurrentRole", "YearsSinceLastPromotion", "YearsWithCurrManager")
data1[,factorcolumns1] <- lapply(data1[,factorcolumns1], as.factor)
skim(data1)
| Name | data1 |
| Number of rows | 870 |
| Number of columns | 31 |
| _______________________ | |
| Column type frequency: | |
| factor | 25 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Attrition | 0 | 1 | FALSE | 2 | No: 730, Yes: 140 |
| BusinessTravel | 0 | 1 | FALSE | 3 | Tra: 618, Tra: 158, Non: 94 |
| Department | 0 | 1 | FALSE | 3 | Res: 562, Sal: 273, Hum: 35 |
| Education | 0 | 1 | FALSE | 5 | 3: 324, 4: 240, 2: 182, 1: 98 |
| EducationField | 0 | 1 | FALSE | 6 | Lif: 358, Med: 270, Mar: 100, Tec: 75 |
| EnvironmentSatisfaction | 0 | 1 | FALSE | 4 | 4: 262, 3: 258, 2: 178, 1: 172 |
| Gender | 0 | 1 | FALSE | 2 | Mal: 516, Fem: 354 |
| JobInvolvement | 0 | 1 | FALSE | 4 | 3: 514, 2: 228, 4: 81, 1: 47 |
| JobLevel | 0 | 1 | FALSE | 5 | 1: 329, 2: 312, 3: 132, 4: 60 |
| JobRole | 0 | 1 | FALSE | 9 | Sal: 200, Res: 172, Lab: 153, Man: 87 |
| JobSatisfaction | 0 | 1 | FALSE | 4 | 4: 271, 3: 254, 1: 179, 2: 166 |
| MaritalStatus | 0 | 1 | FALSE | 3 | Mar: 410, Sin: 269, Div: 191 |
| NumCompaniesWorked | 0 | 1 | FALSE | 10 | 1: 320, 0: 111, 3: 91, 4: 85 |
| OverTime | 0 | 1 | FALSE | 2 | No: 618, Yes: 252 |
| PercentSalaryHike | 0 | 1 | FALSE | 15 | 11: 126, 13: 123, 14: 120, 12: 119 |
| PerformanceRating | 0 | 1 | FALSE | 2 | 3: 738, 4: 132 |
| RelationshipSatisfaction | 0 | 1 | FALSE | 4 | 4: 264, 3: 261, 1: 174, 2: 171 |
| StockOptionLevel | 0 | 1 | FALSE | 4 | 0: 379, 1: 355, 2: 81, 3: 55 |
| TotalWorkingYears | 0 | 1 | FALSE | 39 | 10: 132, 6: 72, 8: 61, 9: 58 |
| TrainingTimesLastYear | 0 | 1 | FALSE | 7 | 2: 309, 3: 308, 5: 75, 4: 73 |
| WorkLifeBalance | 0 | 1 | FALSE | 4 | 3: 532, 2: 192, 4: 98, 1: 48 |
| YearsAtCompany | 0 | 1 | FALSE | 32 | 1: 107, 5: 105, 3: 85, 2: 77 |
| YearsInCurrentRole | 0 | 1 | FALSE | 19 | 2: 223, 0: 151, 7: 136, 3: 68 |
| YearsSinceLastPromotion | 0 | 1 | FALSE | 16 | 0: 342, 1: 214, 2: 94, 7: 41 |
| YearsWithCurrManager | 0 | 1 | FALSE | 17 | 2: 202, 0: 166, 7: 131, 3: 76 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 36.83 | 8.93 | 18 | 30.0 | 35.0 | 43.00 | 60 | ▂▇▇▃▂ |
| DailyRate | 0 | 1 | 815.23 | 401.12 | 103 | 472.5 | 817.5 | 1165.75 | 1499 | ▇▇▇▇▇ |
| DistanceFromHome | 0 | 1 | 9.34 | 8.14 | 1 | 2.0 | 7.0 | 14.00 | 29 | ▇▅▂▂▂ |
| HourlyRate | 0 | 1 | 65.61 | 20.13 | 30 | 48.0 | 66.0 | 83.00 | 100 | ▇▇▆▇▇ |
| MonthlyIncome | 0 | 1 | 6390.26 | 4597.70 | 1081 | 2839.5 | 4945.5 | 8182.00 | 19999 | ▇▅▂▁▁ |
| MonthlyRate | 0 | 1 | 14325.62 | 7108.38 | 2094 | 8092.0 | 14074.5 | 20456.25 | 26997 | ▇▇▇▇▇ |
First, I will take a look at monthly income of employees by the following histogram.
x <- data1$MonthlyIncome
h<-hist(x, breaks=10, col="steelblue", xlab="Monthly Income",
main="Histogram with Normal Curve for Monthly Income")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="red", lwd=2)
The following code will show the minimum and maximum salary of 870 employees in the dataset.
range(data1$MonthlyIncome)
## [1] 1081 19999
Then the Monthly Income is from 1081 USD to 19999 USD. By the histogram, it is right skewed. The question is that “Will we transform this variable before studying?”
By the histogram, we see most people have salary in the range [2000,4000] and second range is [4000,6000]. I will divide into 6 groups as follows: 1081 - <2000, 2000 - <4000, 4000 - <6000, 6000 - <10000, and from 10000 - <16000 and 16000 - <20000.
data1$IncomeGroup <- cut(data1$MonthlyIncome, c(0,2000,4000,6000,10000,16000,20000), labels = c("<$2000","$2000-$4000","$4000 - $6000","$6000-$10000","$10000-$16000","$16000-$200000"), include.lowest = TRUE)
By the following boxplot,
ggplot(data1, aes_string(x = "IncomeGroup", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Income Groups")+ xlab("Income Groups") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
and the following barplot,
ggplot(data1, aes_string("IncomeGroup")) + geom_bar(fill="steelblue") + xlab ("Monthly Income Group") + ylab("Number of Employees") +ggtitle("Monthly Income vs. Employee Groups")
we can see that:
The salary less than 2000 USD is considered as in lower class.
The salary from 2000 USD to 4000 USD is considered as in lower middle class and from 4000 USD to 6000 USD is considered as in middle class and from 6000 USD to 10000 USD is considered as in upper middle class.
The salary from 10000 USD to 16000 USD is considered as in the lower high class and from 16000 USD to 20000 USD is considered as in the high class.
Now I will study the relationship between Income Group and Attrition.
ggplot(data1, aes_string(x = "IncomeGroup", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Income Group") + ylab("Percent numbers of employees")
I will observe the atrrition data first to see the percentage of employees who left jobs.
stats <- function(df, x) {
df %>% group_by_at(x) %>%
summarise(Count = n(), Proportion = scales::percent(n()/dim(df)[1])) %>%
kable() %>% kable_styling(full_width = FALSE)
}
stats(data1, "Attrition")
| Attrition | Count | Proportion |
|---|---|---|
| No | 730 | 83.9% |
| Yes | 140 | 16.1% |
ggplot(data1, aes_string("Attrition")) + geom_bar(fill="steelblue") + xlab ("Attrition") + ylab("Number of Employees")
First, I will observe Ages of Employees in the dataset.
summary(data1$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 30.00 35.00 36.83 43.00 60.00
We can also use the following code to see the range of ages.
range(data1$Age)
## [1] 18 60
Now I will take a look at Age variable by the following histogram.
x <- data1$Age
h<-hist(x, breaks=10, col="steelblue", xlab="Age",
main="Histogram with Normal Curve for Employee Ages")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="red", lwd=2)
We will see the relationship between Age and Attrition by the following barplot.
ggplot(data1, aes_string(x = "Age", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Age") + ylab("Percent numbers of employees")
By the following Scatterplot, we can see the relationship between Age and Monthly Income.
ggplot(data1, aes_string(x = "Age", y = "MonthlyIncome")) + geom_point() + geom_smooth(method="lm")
By the histogram, I will divide Age into 4 groups: 18-25, 25-35, 35-45, 45-60.
data1$AgeGroup <- cut(data1$Age, c(18,25,35,45,60), labels = c("18-25","25-35","35-45","45-60"), include.lowest = TRUE)
Now I will study the relationship between Age groups and Attrition.
ggplot(data1, aes_string(x = "AgeGroup", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Age Groups") + ylab("Percentage of employees")
We can also see the relationship between Age Groups and Monthly Income here.
ggplot(data1, aes_string(x = "AgeGroup", y = "MonthlyIncome")) + geom_point() + geom_smooth(method="lm")
ggplot(data1, aes_string(x = "AgeGroup", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Age Groups")+ xlab("Age Groups") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
summary(data1$BusinessTravel)
## Non-Travel Travel_Frequently Travel_Rarely
## 94 158 618
stats(data1,"BusinessTravel")
| BusinessTravel | Count | Proportion |
|---|---|---|
| Non-Travel | 94 | 10.8% |
| Travel_Frequently | 158 | 18.2% |
| Travel_Rarely | 618 | 71.0% |
Now I will study the relationship between Bussiness Travel and Age.
ggplot(data1, aes_string(x = "AgeGroup", fill = "BusinessTravel")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Age Groups") + ylab("Business Travel")
Now we will see the relationship between Business Travel and Attrition.
ggplot(data1, aes_string(x = "BusinessTravel", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Business Travel") + ylab("Attrition")
ggplot(data1, aes_string(x = "BusinessTravel", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Business Travel Groups")+ xlab("Business Travel") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
summary(data1$Department)
## Human Resources Research & Development Sales
## 35 562 273
stats(data1, "Department")
| Department | Count | Proportion |
|---|---|---|
| Human Resources | 35 | 4.02% |
| Research & Development | 562 | 64.6% |
| Sales | 273 | 31.4% |
ggplot(data1, aes_string(x = "Department", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Department") + ylab("Attrition")
ggplot(data1, aes_string(x = "Department", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Department")+ xlab("Department") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
The following histogram will show us the distance from home of Employees.
x <- data1$DistanceFromHome
h<-hist(x, breaks=10, col="steelblue", xlab="Distance from Home",
main="Distance from Home vs. Number of Employees")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="red", lwd=2)
ggplot(data1, aes_string(x = "DistanceFromHome", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Distance from Home") + ylab("Attrition")
stats(data1, "Education")
| Education | Count | Proportion |
|---|---|---|
| 1 | 98 | 11.3% |
| 2 | 182 | 20.9% |
| 3 | 324 | 37.2% |
| 4 | 240 | 27.6% |
| 5 | 26 | 2.99% |
ggplot(data1, aes_string(x = "Education", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Education") + ylab("Attrition")
ggplot(data1, aes_string(x = "Education", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Education")+ xlab("Education") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "EducationField")
| EducationField | Count | Proportion |
|---|---|---|
| Human Resources | 15 | 1.72% |
| Life Sciences | 358 | 41.1% |
| Marketing | 100 | 11.5% |
| Medical | 270 | 31.0% |
| Other | 52 | 5.98% |
| Technical Degree | 75 | 8.62% |
ggplot(data1, aes_string(x = "EducationField", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Education Field") + ylab("Attrition")
ggplot(data1, aes_string(x = "EducationField", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Education Field")+ xlab("Education Field") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "EnvironmentSatisfaction")
| EnvironmentSatisfaction | Count | Proportion |
|---|---|---|
| 1 | 172 | 19.8% |
| 2 | 178 | 20.5% |
| 3 | 258 | 29.7% |
| 4 | 262 | 30.1% |
ggplot(data1, aes_string(x = "EnvironmentSatisfaction", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Environment Satisfaction") + ylab("Attrition")
ggplot(data1, aes_string(x = "EnvironmentSatisfaction", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Environment Satisfaction")+ xlab("Environment Satisfaction") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "Gender")
| Gender | Count | Proportion |
|---|---|---|
| Female | 354 | 40.7% |
| Male | 516 | 59.3% |
ggplot(data1, aes_string(x = "Gender", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Gender") + ylab("Attrition")
ggplot(data1, aes_string(x = "Gender", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Gender")+ xlab("Gender") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "JobInvolvement")
| JobInvolvement | Count | Proportion |
|---|---|---|
| 1 | 47 | 5.40% |
| 2 | 228 | 26.2% |
| 3 | 514 | 59.1% |
| 4 | 81 | 9.31% |
ggplot(data1, aes_string(x = "JobInvolvement", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Job Involvement") + ylab("Attrition")
ggplot(data1, aes_string(x = "JobInvolvement", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Job Involvement")+ xlab("Job Involvement") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "JobLevel")
| JobLevel | Count | Proportion |
|---|---|---|
| 1 | 329 | 37.8% |
| 2 | 312 | 35.9% |
| 3 | 132 | 15.2% |
| 4 | 60 | 6.90% |
| 5 | 37 | 4.25% |
ggplot(data1, aes_string(x = "JobLevel", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Job Level") + ylab("Attrition")
ggplot(data1, aes_string(x = "JobLevel", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Job Level")+ xlab("Job Level") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "JobRole")
| JobRole | Count | Proportion |
|---|---|---|
| Healthcare Representative | 76 | 8.74% |
| Human Resources | 27 | 3.10% |
| Laboratory Technician | 153 | 17.6% |
| Manager | 51 | 5.86% |
| Manufacturing Director | 87 | 10.0% |
| Research Director | 51 | 5.86% |
| Research Scientist | 172 | 19.8% |
| Sales Executive | 200 | 23.0% |
| Sales Representative | 53 | 6.09% |
ggplot(data1, aes_string(x = "JobRole", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Job Role") + ylab("Attrition")
ggplot(data1, aes_string(x = "JobRole", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Job Role")+ xlab("Job Role") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "JobSatisfaction")
| JobSatisfaction | Count | Proportion |
|---|---|---|
| 1 | 179 | 20.6% |
| 2 | 166 | 19.1% |
| 3 | 254 | 29.2% |
| 4 | 271 | 31.1% |
ggplot(data1, aes_string(x = "JobSatisfaction", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Job Satisfaction") + ylab("Attrition")
ggplot(data1, aes_string(x = "JobSatisfaction", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Job Satisfaction")+ xlab("Job Satisfaction") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "MaritalStatus")
| MaritalStatus | Count | Proportion |
|---|---|---|
| Divorced | 191 | 22.0% |
| Married | 410 | 47.1% |
| Single | 269 | 30.9% |
ggplot(data1, aes_string(x = "MaritalStatus", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Marital Status") + ylab("Attrition")
ggplot(data1, aes_string(x = "MaritalStatus", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Marital Status")+ xlab("Marital Status") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "NumCompaniesWorked")
| NumCompaniesWorked | Count | Proportion |
|---|---|---|
| 0 | 111 | 12.8% |
| 1 | 320 | 36.8% |
| 2 | 74 | 8.51% |
| 3 | 91 | 10.5% |
| 4 | 85 | 9.77% |
| 5 | 43 | 4.94% |
| 6 | 39 | 4.48% |
| 7 | 46 | 5.29% |
| 8 | 28 | 3.22% |
| 9 | 33 | 3.79% |
ggplot(data1, aes_string(x = "NumCompaniesWorked", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Number of Companies Worked") + ylab("Attrition")
ggplot(data1, aes_string(x = "NumCompaniesWorked", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Number of Companies Worked")+ xlab("Number of Companies Worked") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "OverTime")
| OverTime | Count | Proportion |
|---|---|---|
| No | 618 | 71.0% |
| Yes | 252 | 29.0% |
ggplot(data1, aes_string(x = "OverTime", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Overtime") + ylab("Attrition")
ggplot(data1, aes_string(x = "OverTime", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Overtime")+ xlab("Overtime") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "PercentSalaryHike")
| PercentSalaryHike | Count | Proportion |
|---|---|---|
| 11 | 126 | 14.5% |
| 12 | 119 | 13.7% |
| 13 | 123 | 14.1% |
| 14 | 120 | 13.8% |
| 15 | 54 | 6.21% |
| 16 | 43 | 4.94% |
| 17 | 56 | 6.44% |
| 18 | 57 | 6.55% |
| 19 | 40 | 4.60% |
| 20 | 27 | 3.10% |
| 21 | 33 | 3.79% |
| 22 | 30 | 3.45% |
| 23 | 17 | 1.95% |
| 24 | 14 | 1.61% |
| 25 | 11 | 1.26% |
ggplot(data1, aes_string(x = "PercentSalaryHike", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Percent Salary Hike") + ylab("Attrition")
ggplot(data1, aes_string(x = "PercentSalaryHike", y = "MonthlyIncome", fill = x)) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Percent Salary Hike")+ xlab("Percent Salary Hike") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "PerformanceRating")
| PerformanceRating | Count | Proportion |
|---|---|---|
| 3 | 738 | 84.8% |
| 4 | 132 | 15.2% |
ggplot(data1, aes_string(x = "PerformanceRating", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Performance Rating") + ylab("Attrition")
ggplot(data1, aes_string(x = "PerformanceRating", y = "MonthlyIncome")) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Performance Rating")+ xlab("Performance Rating") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "RelationshipSatisfaction")
| RelationshipSatisfaction | Count | Proportion |
|---|---|---|
| 1 | 174 | 20.0% |
| 2 | 171 | 19.7% |
| 3 | 261 | 30.0% |
| 4 | 264 | 30.3% |
ggplot(data1, aes_string(x = "RelationshipSatisfaction", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Relationship Satisfaction") + ylab("Attrition")
ggplot(data1, aes_string(x = "RelationshipSatisfaction", y = "MonthlyIncome")) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Relationship Satisfaction")+ xlab("Relationship Satisfaction") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "StockOptionLevel")
| StockOptionLevel | Count | Proportion |
|---|---|---|
| 0 | 379 | 43.6% |
| 1 | 355 | 40.8% |
| 2 | 81 | 9.31% |
| 3 | 55 | 6.32% |
ggplot(data1, aes_string(x = "StockOptionLevel", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Stock Option Level") + ylab("Attrition")
ggplot(data1, aes_string(x = "StockOptionLevel", y = "MonthlyIncome")) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Stock Option Level")+ xlab("Stock Option Level") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "TotalWorkingYears")
| TotalWorkingYears | Count | Proportion |
|---|---|---|
| 0 | 7 | 0.805% |
| 1 | 48 | 5.52% |
| 2 | 20 | 2.30% |
| 3 | 27 | 3.10% |
| 4 | 35 | 4.02% |
| 5 | 49 | 5.63% |
| 6 | 72 | 8.28% |
| 7 | 47 | 5.40% |
| 8 | 61 | 7.01% |
| 9 | 58 | 6.67% |
| 10 | 132 | 15.2% |
| 11 | 20 | 2.30% |
| 12 | 23 | 2.64% |
| 13 | 25 | 2.87% |
| 14 | 22 | 2.53% |
| 15 | 29 | 3.33% |
| 16 | 24 | 2.76% |
| 17 | 17 | 1.95% |
| 18 | 13 | 1.49% |
| 19 | 13 | 1.49% |
| 20 | 17 | 1.95% |
| 21 | 17 | 1.95% |
| 22 | 14 | 1.61% |
| 23 | 14 | 1.61% |
| 24 | 9 | 1.03% |
| 25 | 6 | 0.690% |
| 26 | 8 | 0.920% |
| 27 | 3 | 0.345% |
| 28 | 6 | 0.690% |
| 29 | 4 | 0.460% |
| 30 | 5 | 0.575% |
| 31 | 6 | 0.690% |
| 32 | 6 | 0.690% |
| 33 | 2 | 0.230% |
| 34 | 3 | 0.345% |
| 35 | 2 | 0.230% |
| 36 | 4 | 0.460% |
| 37 | 1 | 0.115% |
| 40 | 1 | 0.115% |
ggplot(data1, aes_string(x = "TotalWorkingYears", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Total Working Years") + ylab("Attrition")
ggplot(data1, aes_string(x = "TotalWorkingYears", y = "MonthlyIncome")) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Total Working Years")+ xlab("Total Working Years") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "TrainingTimesLastYear")
| TrainingTimesLastYear | Count | Proportion |
|---|---|---|
| 0 | 30 | 3.45% |
| 1 | 39 | 4.48% |
| 2 | 309 | 35.5% |
| 3 | 308 | 35.4% |
| 4 | 73 | 8.39% |
| 5 | 75 | 8.62% |
| 6 | 36 | 4.14% |
ggplot(data1, aes_string(x = "TrainingTimesLastYear", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Training Times Last Year") + ylab("Attrition")
ggplot(data1, aes_string(x = "TrainingTimesLastYear", y = "MonthlyIncome")) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Training Times Last Year")+ xlab("Training Times Last Years") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "WorkLifeBalance")
| WorkLifeBalance | Count | Proportion |
|---|---|---|
| 1 | 48 | 5.52% |
| 2 | 192 | 22.1% |
| 3 | 532 | 61.1% |
| 4 | 98 | 11.3% |
ggplot(data1, aes_string(x = "WorkLifeBalance", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Work Life Balance") + ylab("Attrition")
ggplot(data1, aes_string(x = "WorkLifeBalance", y = "MonthlyIncome")) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Work Life Balance")+ xlab("Work Life Balance") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "YearsAtCompany")
| YearsAtCompany | Count | Proportion |
|---|---|---|
| 0 | 28 | 3.22% |
| 1 | 107 | 12.3% |
| 2 | 77 | 8.85% |
| 3 | 85 | 9.77% |
| 4 | 54 | 6.21% |
| 5 | 105 | 12.1% |
| 6 | 44 | 5.06% |
| 7 | 42 | 4.83% |
| 8 | 50 | 5.75% |
| 9 | 52 | 5.98% |
| 10 | 77 | 8.85% |
| 11 | 17 | 1.95% |
| 12 | 9 | 1.03% |
| 13 | 20 | 2.30% |
| 14 | 14 | 1.61% |
| 15 | 11 | 1.26% |
| 16 | 6 | 0.690% |
| 17 | 5 | 0.575% |
| 18 | 8 | 0.920% |
| 19 | 8 | 0.920% |
| 20 | 14 | 1.61% |
| 21 | 6 | 0.690% |
| 22 | 12 | 1.38% |
| 23 | 1 | 0.115% |
| 24 | 4 | 0.460% |
| 25 | 2 | 0.230% |
| 26 | 3 | 0.345% |
| 30 | 1 | 0.115% |
| 31 | 2 | 0.230% |
| 32 | 3 | 0.345% |
| 33 | 2 | 0.230% |
| 40 | 1 | 0.115% |
ggplot(data1, aes_string(x = "YearsAtCompany", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("Years At Company") + ylab("Attrition")
ggplot(data1, aes_string(x = "YearsAtCompany", y = "MonthlyIncome")) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Years At Company")+ xlab("Years At Company") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "YearsInCurrentRole")
| YearsInCurrentRole | Count | Proportion |
|---|---|---|
| 0 | 151 | 17.4% |
| 1 | 38 | 4.37% |
| 2 | 223 | 25.6% |
| 3 | 68 | 7.82% |
| 4 | 53 | 6.09% |
| 5 | 26 | 2.99% |
| 6 | 17 | 1.95% |
| 7 | 136 | 15.6% |
| 8 | 56 | 6.44% |
| 9 | 40 | 4.60% |
| 10 | 14 | 1.61% |
| 11 | 15 | 1.72% |
| 12 | 7 | 0.805% |
| 13 | 9 | 1.03% |
| 14 | 7 | 0.805% |
| 15 | 3 | 0.345% |
| 16 | 3 | 0.345% |
| 17 | 3 | 0.345% |
| 18 | 1 | 0.115% |
ggplot(data1, aes_string(x = "YearsInCurrentRole", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("YearsInCurrentRole") + ylab("Attrition")
ggplot(data1, aes_string(x = "YearsInCurrentRole", y = "MonthlyIncome")) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. YearsInCurrentRole")+ xlab("YearsInCurrentRole") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "YearsSinceLastPromotion")
| YearsSinceLastPromotion | Count | Proportion |
|---|---|---|
| 0 | 342 | 39.3% |
| 1 | 214 | 24.6% |
| 2 | 94 | 10.8% |
| 3 | 32 | 3.68% |
| 4 | 32 | 3.68% |
| 5 | 30 | 3.45% |
| 6 | 23 | 2.64% |
| 7 | 41 | 4.71% |
| 8 | 12 | 1.38% |
| 9 | 9 | 1.03% |
| 10 | 4 | 0.460% |
| 11 | 14 | 1.61% |
| 12 | 5 | 0.575% |
| 13 | 5 | 0.575% |
| 14 | 5 | 0.575% |
| 15 | 8 | 0.920% |
ggplot(data1, aes_string(x = "YearsSinceLastPromotion", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("YearsSinceLastPromotion") + ylab("Attrition")
ggplot(data1, aes_string(x = "YearsSinceLastPromotion", y = "MonthlyIncome")) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. YearsSinceLastPromotion")+ xlab("YearsSinceLastPromotion") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
stats(data1, "YearsWithCurrManager")
| YearsWithCurrManager | Count | Proportion |
|---|---|---|
| 0 | 166 | 19.1% |
| 1 | 40 | 4.60% |
| 2 | 202 | 23.2% |
| 3 | 76 | 8.74% |
| 4 | 51 | 5.86% |
| 5 | 22 | 2.53% |
| 6 | 12 | 1.38% |
| 7 | 131 | 15.1% |
| 8 | 68 | 7.82% |
| 9 | 44 | 5.06% |
| 10 | 18 | 2.07% |
| 11 | 11 | 1.26% |
| 12 | 13 | 1.49% |
| 13 | 7 | 0.805% |
| 14 | 4 | 0.460% |
| 15 | 1 | 0.115% |
| 17 | 4 | 0.460% |
ggplot(data1, aes_string(x = "YearsWithCurrManager", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16) + xlab("YearsWithCurrManager") + ylab("Attrition")
ggplot(data1, aes_string(x = "YearsWithCurrManager", y = "MonthlyIncome")) +
geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Distribution between Monthly Income vs. YearsWithCurrManager")+ xlab("YearsWithCurrManager") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
There are 4 similar variables HourlyRate, DailyRate, MonthlyRate, MonthlyIncome. I will see the relationship of these 3 first variables with MonthlyIncome and these 4 variables with Attrition.
ggpairs(data = data1,
mapping = aes(color = Attrition),
columns = c("HourlyRate","DailyRate","MonthlyRate","MonthlyIncome"))
Then I will not consider these variables in the future analysis.
There are 3 variables related to Satisfaction level: EnvironmentSatisfaction, JobSatisfaction, RelationshipSatisfaction.
First, I will change these variables from factor to numeric.
numcolumns <- c("EnvironmentSatisfaction", "JobSatisfaction", "RelationshipSatisfaction")
data1[,numcolumns] <- lapply(data1[,numcolumns], as.numeric)
data1$Satisfaction <- as.factor(round((data1$EnvironmentSatisfaction +data1$JobSatisfaction+
data1$RelationshipSatisfaction)/3))
stats(data1, "Satisfaction")
| Satisfaction | Count | Proportion |
|---|---|---|
| 1 | 24 | 2.76% |
| 2 | 296 | 34.0% |
| 3 | 455 | 52.3% |
| 4 | 95 | 10.9% |
ggplot(data1, aes_string(x = "Satisfaction", fill = "Attrition")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16)
ggplot(data1, aes_string(x = "Satisfaction", fill = "IncomeGroup")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
geom_abline(slope = 0, intercept = .16)
After the first analyses on each variable,I will create a new dataset as follows.
data2 <- select(data0, -c("HourlyRate","DailyRate", "MonthlyRate","PerformanceRating"))
skim(data2)
| Name | data2 |
| Number of rows | 870 |
| Number of columns | 27 |
| _______________________ | |
| Column type frequency: | |
| factor | 12 |
| numeric | 15 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Attrition | 0 | 1 | FALSE | 2 | No: 730, Yes: 140 |
| BusinessTravel | 0 | 1 | FALSE | 3 | Tra: 618, Tra: 158, Non: 94 |
| Department | 0 | 1 | FALSE | 3 | Res: 562, Sal: 273, Hum: 35 |
| EducationField | 0 | 1 | FALSE | 6 | Lif: 358, Med: 270, Mar: 100, Tec: 75 |
| Gender | 0 | 1 | FALSE | 2 | Mal: 516, Fem: 354 |
| JobInvolvement | 0 | 1 | FALSE | 4 | 3: 514, 2: 228, 4: 81, 1: 47 |
| JobRole | 0 | 1 | FALSE | 9 | Sal: 200, Res: 172, Lab: 153, Man: 87 |
| JobSatisfaction | 0 | 1 | FALSE | 4 | 4: 271, 3: 254, 1: 179, 2: 166 |
| MaritalStatus | 0 | 1 | FALSE | 3 | Mar: 410, Sin: 269, Div: 191 |
| OverTime | 0 | 1 | FALSE | 2 | No: 618, Yes: 252 |
| RelationshipSatisfaction | 0 | 1 | FALSE | 4 | 4: 264, 3: 261, 1: 174, 2: 171 |
| WorkLifeBalance | 0 | 1 | FALSE | 4 | 3: 532, 2: 192, 4: 98, 1: 48 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 36.83 | 8.93 | 18 | 30.0 | 35.0 | 43 | 60 | ▂▇▇▃▂ |
| DistanceFromHome | 0 | 1 | 9.34 | 8.14 | 1 | 2.0 | 7.0 | 14 | 29 | ▇▅▂▂▂ |
| Education | 0 | 1 | 2.90 | 1.02 | 1 | 2.0 | 3.0 | 4 | 5 | ▂▅▇▆▁ |
| EnvironmentSatisfaction | 0 | 1 | 2.70 | 1.10 | 1 | 2.0 | 3.0 | 4 | 4 | ▅▆▁▇▇ |
| JobLevel | 0 | 1 | 2.04 | 1.09 | 1 | 1.0 | 2.0 | 3 | 5 | ▇▇▃▂▁ |
| MonthlyIncome | 0 | 1 | 6390.26 | 4597.70 | 1081 | 2839.5 | 4945.5 | 8182 | 19999 | ▇▅▂▁▁ |
| NumCompaniesWorked | 0 | 1 | 2.73 | 2.52 | 0 | 1.0 | 2.0 | 4 | 9 | ▇▃▂▂▁ |
| PercentSalaryHike | 0 | 1 | 15.20 | 3.68 | 11 | 12.0 | 14.0 | 18 | 25 | ▇▅▃▂▁ |
| StockOptionLevel | 0 | 1 | 0.78 | 0.86 | 0 | 0.0 | 1.0 | 1 | 3 | ▇▇▁▂▁ |
| TotalWorkingYears | 0 | 1 | 11.05 | 7.51 | 0 | 6.0 | 10.0 | 15 | 40 | ▇▇▂▁▁ |
| TrainingTimesLastYear | 0 | 1 | 2.83 | 1.27 | 0 | 2.0 | 3.0 | 3 | 6 | ▂▇▇▂▃ |
| YearsAtCompany | 0 | 1 | 6.96 | 6.02 | 0 | 3.0 | 5.0 | 10 | 40 | ▇▃▁▁▁ |
| YearsInCurrentRole | 0 | 1 | 4.20 | 3.64 | 0 | 2.0 | 3.0 | 7 | 18 | ▇▃▂▁▁ |
| YearsSinceLastPromotion | 0 | 1 | 2.17 | 3.19 | 0 | 0.0 | 1.0 | 3 | 15 | ▇▁▁▁▁ |
| YearsWithCurrManager | 0 | 1 | 4.14 | 3.57 | 0 | 2.0 | 3.0 | 7 | 17 | ▇▂▅▁▁ |